Comparison of Three Boosting Models on Parkinsons
Prediction
Context:
Parkinson’s disease (PD) is a disabling brain
disorder that affects movements, cognition, sleep,
and other normal functions. Unfortunately, there
is no current cure—and the disease worsens
over time. It's estimated that by 2037, 1.6
million people in the U.S. will have Parkinson’s
disease, at an economic cost approaching $80
billion. Research indicates that protein or peptide
abnormalities play a key role in the onset and
worsening of this disease [1].
Overview:
Three tree based ensemble models are compared for
predicting the categorical UPDR rating of
Parkinsons symptoms. The models compared are
XGBoost, LightGBM, and CatBoost.
Additionally, the data is filtered to the first
12 months of visits for improved performance and a
few engineered features are added. The target
shows a significant imbalance in classes and so
SMOTE (Synthetic Minority Oversampling
Technique) is used to balance the target for training.
Model hyperparameter tuning is performed using
the hyperopt package which uses Bayesian
optimization for exploring the search space of
hyperparameters. Lastly, the information of
whether the patient was on medication during the
clinical visit is compared for model
performance. The medication information has many
missing values but shows predictive
improvement in UPDRS 1 and UPDRS 3. AUC-ROC is used as the main comparison metric
between models. The categorical threshold for
the probability classification is fine-tuned
to
optimize in favor of Recall while also looking
at the highest F1 score. Recall is favored over
Precision to minimize False Negatives, which
could cause patients to not seek treatment
sooner. While False Positives have a negative
impact on a patient, because they will likely have
more frequent doctors visits, it is not as
negatively impactful as a “likely” Parkinson's patient
misdiagnosed as being “not at risk.”
The default model parameters with no SMOTE and
no medication data gave AUC-ROC around
0.59. The best performance for UPDRS 1 is
AUC-ROC of 0.796 from a CatBoost Classifier using
the Hyperopt hyperparameters and the data with
SMOTE applied. The best performance for
UPDRS 2 is AUC-ROC of 0.881 from a CatBoost
Classifier using the Hyperopt hyperparameters,
with the data medication data, and with SMOTE
applied. The best performance for UPDRS 3 is
AUC-ROC of 0.729 from a LightGBM Classifier
using the Hyperopt hyperparameters, with the
data medication data, and with SMOTE applied.